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Abstract:  This  work  aims  at  solving  issues  in  distributed  machine  learning.  We  propose  three 
directions  to  work  on.  First,  we  design  solutions  to  speed  up  the  alternating  direction  method  of 
multipliers  (ADMM)  for  distributed  data.  Second,  we  focus  on  a  client-server  learning  scenario 
in  which  an  online,  semi-supervised  learning  approach  is  designed  to  reduce  the 
communication  load.  Finally,  we  propose  the  parallel  least-squares  policy  iteration  (parallel 
LSPI)  to  parallelize  a  reinforcement  policy  learning. 

I  ntroduction: 

Speeding  up  ADMM 

We  consider  training  a  classifier  given  large  data  that  are  distributed.  Since  data  are 
distributed  on  different  machines  and  are  too  big  to  relocated,  an  efficient  way  to  train  a 
classifier  is  to  develop  a  distributed  optimization  method.  The  alternating  direction  method  of 
multipliers  (ADMM)  can  be  exploited  to  solve  this  problem,  in  which  each  local  machine 
updates  its  learned  model  and  the  master  machine  tries  to  reach  a  consensus.  Specifically,  the 
setting  we  consider  is  that  there  are  g  machines  with  each  machine  m  learns  a  model  wm  from 
its  local  data  Nm.  The  learned  models  wm  are  required  to  agree  with  each  other  to  form  a 
consensus  c.  Mathematically,  we  are  interested  in  solving  the  following  linear  classification 
problem, 


minimize 2, .... ,  l vg,  c}  c  max ^  1 


-I — c  c 


(1) 


s.  t  wm  -  c  =  0 
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where  1,  and  Zi  is  a  label/feature  pair  of  the  i  sample  on  machine  m,  and  C  controls  the 
regularization  effect.  Although  squared  hinge  loss  is  shown  here,  the  objective  is  in  fact 
general  and  other  types  of  loss  function  can  be  utilized.  For  example,  if  we  use  squared  loss, 
the  objective  becomes  a  regression  problem. 

There  exist  some  related  works  using  ADMM  for  distributed  data.  For  example,  [Zhang  et.  al. 
2012]  have  implemented  ADMM  to  solve  it.  The  update  of  w  at  an  iteration  of  ADMM  turns 
out  to  be  similar  to  standard  SVM  formulation.  Thus,  [Zhang  et.  al.  2012]  adapted  the 
approach,  stochastic  dual  coordinate  assent,  which  is  implemented  in  a  well-known  LIBSVM 
toolkit  to  solve  it.  As  the  result,  several  "inner"  iterations  are  required  to  obtain  wm  at  each 
iteration  of  ADMM.  When  the  data  on  each  local  machine  is  also  large,  the  computational 
cost  for  one  pass  of  full  data  drastically  increases,  which  makes  the  update  of  wm  consume  a 
substantial  amount  of  time.  We  aim  at  dealing  with  this  issue  and  proposed  two  method. 

The  first  approach  is  leveraging  the  idea  from  stochastic  gradient  descent  method  (SGD) 
[Bottou  et  al.  2010]  for  ADMM.  SGD  has  shown  its  merits  on  solving  large-scale 
optimization  problem.  At  each  iteration,  it  samples  an  instance  to  compute  the  gradient 
instead  of  using  all  the  data  as  in  the  traditional  gradient  descent  method.  Several  works  has 
shown  that  SGD  can  solve  large-scale  problem  more  efficiently  comparing  to  gradient 
descent  [Bottou  et  al.  2010].  It  is  known  for  SGD  that  only  a  few  samples  are  needed  to 
achieve  sufficient  descent  of  the  objective  at  the  beginning.  To  achieve  similar  effect  in 
ADMM,  at  the  first  few  iterations,  each  local  machine  only  uses  a  subset  of  its  data  instead  of 
using  all  the  data  to  update  its  model.  For  example,  at  the  first  iteration,  each  machine 
samples  half  of  its  data  to  compute  wm.  Then,  the  sample  size  is  increased  for  next  iteration, 
and  eventually  each  machine  utilizes  all  of  its  local  data.  The  method  requires  fewer 
computations  at  the  first  couple  iterations,  and  therefore  has  the  potential  to  achieve  faster 
convergence.  Since  the  method  has  cheaper  iteration  costs  at  the  beginning,  the  method 
converges  faster  in  terms  of  time.  More  importantly,  we  provide  a  theorem  that  guarantees  the 
method  to  enjoy  the  same  convergence  rate  in  terms  of  the  number  of  the  iterations  as  the 
standard  ADMM. 

Following  the  idea  of  sampling  subset  of  data  instead  of  full  to  update  the  model  before  each 
round  of  communication,  we  convert  the  objective  (1)  to  the  dual  domain  and  proposed  our 
second  approach  that  performs  ADMM  on  the  dual  domain.  As  each  dual  variable 
corresponds  to  a  sample,  sampling  a  subset  to  update  the  model  becomes  easier  and  more 
natural  than  in  primal  domain.  The  algorithm  for  performing  ADMM  on  the  dual  of  (1)  turns 
out  to  be  equivalent  to  SDCA-ADMM  [Suzuki  2014],  which  is  originally  proposed  to  solving 
objective  with  complex  regularizations  (e.g.  group  lasso  [Jacob  et  al.  2009],  graph  guided 
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SVM  [Ouyang  et  al.  2013]  by  combining  ADMM  and  stochastic  dual  coordinate  ascent 
(SDCA).  However,  these  works  do  not  consider  generalizing  the  method  on  distributed  data. 
We  show  the  algorithm  can  solve  for  optimization  on  distributed  data  as  well. 

To  summarize,  our  contribution  are  1)  proposing  a  simple,  easy  to  implement,  yet  effective 
way  to  accelerate  the  ADMM  on  distributed  data  with  a  theoretical  guarantee,  2)  proposing 
running  ADMM  on  the  dual  of  the  objective  and  showing  the  advantages  of  doing  it,  3) 
showing  the  effectiveness  of  our  methods  on  several  datasets. 

Communication-Efficient  Online  Semi-Supervised  Learning  in  Client-Server  Settings 
This  work  considers  such  a  setting  where  a  set  of  distributed  clients  each  generate  an  ongoing 
stream  of  data  and  a  server  seeks  to  learn  a  model  of  the  data.  We  impose  two  practical 
limitations  on  the  setting.  First,  because  of  the  costs  of  having  humans  label  large  quantities 
of  data,  we  assume  that  only  a  small  fraction  of  the  data  are  labeled.  In  particular,  we  focus  on 
a  setting  where  only  the  first,  e.g.,  2%  of  the  training  data  are  labeled.  Second,  because 
communication  bandwidth  is  often  expensive  and  battery-draining  (e.g.,  a  mobile  device  on  a 
cellular  network),  we  seek  communication-efficient  solutions  such  that  each  client  is  limited 
to  sending  to  the  server  only  a  small  fraction  of  the  unlabeled  data  it  generates,  and  limited  in 
how  much  information  it  receives  from  the  server. 

An  elegant  solution  to  these  problems  will  face  many  challenges.  First,  the  amount  of  data 
generated  by  clients  can  be  huge,  and  even  potentially  unlimited.  As  a  result,  the  vast  majority 
of  data  on  the  server  are  unlabeled.  Typically,  it  is  not  sufficient  to  train  a  model  with  a  good 
generalization  ability  based  merely  on  limited  labeled  data.  Second,  when  the  volume  and 
velocity  of  data  is  high,  it  is  very  costly  and  impossible  to  store  all  data  either  on  clients  or  the 
server.  Thus,  traditional  approaches  that  first  store  data  and  then  train  on  a  static  collection  are 
not  appropriate  in  this  case.  Third,  transmitting  massive  data  on  the  network  is  discouraged  in 
practice,  especially  when  the  network  bandwidth  is  restricted  or  the  communication  cost  is 
expensive  (e.g.,  on  a  cellular  network).  It  may  also  be  mis-classified  as  a  denial-of-service 
attack ,  and  dropped/blocked. 

By  considering  online,  semi-supervised,  and  active  learning  jointly,  our  goal  is  to  develop  a 
modular  framework  for  learning  from  a  remote  partially  labeled  data  stream  while  reducing 
the  bandwidth  consumption.  We  present  a  novel  framework  for  solving  this  learning  problem 
in  an  effective  and  communication-efficient  manner  (see  Figure  1).  On  the  server  side,  our 
solution  combines  two  diverse  learners  working  collaboratively,  yet  in  distinct  roles,  on  the 
partially  labeled  data  stream.  A  compact,  online  graph-based  semi-supervised  learner  is  used 
to  predict  labels  for  the  unlabeled  data  arriving  from  the  clients.  Specifically,  we  adapt  the 
Harmonic  Solution  learner  to  online  use  via  an  incremental  k-center  clustering  approach  that 
maintains  the  graph  structure  solely  on  a  set  of  k  centroid  nodes.  Random  samples  are  then 
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repeatedly  drawn  from  the  model  according  to  the  confidence  of  its  prediction,  and  used  to 
train  a  second  learner  on  the  server,  a  linear  classifier  (specifically,  a  soft 
confidence- weighted  classifier).  The  second  learner  updates  its  hypothesis  based  on  these 
samples  and  their  predicted  labels.  We  show  how  these  two  learners  can  be  combined  in  an 
optimization  problem.  On  the  client  side,  our  solution  prioritizes  data  based  on  an 
active-learning  metric  that  favors  instances  that  are  more  uncertain  (i.e.,  close  to  the 
classifier's  decision  hyperplane)  and  yet  far  from  each  other  (as  measured  by  covariance).  To 
reduce  communication,  the  server  sends  the  classifier's  weight- vector  to  the  client  only 
periodically.  At  any  point  in  time,  the  classifier  can  be  used  as  a  standalone  model  for 
predicting  labels  for  new  test  data. 

Parallel  least-squares  policy  iteration 

Learning  an  optimal  policy  for  MDPs  with  large  state  space  has  gained  many  interests 
recently.  Different  from  previous  works,  our  proposed  method  is  inspired  by  the  recent 
success  in  distributed  optimization.  The  goal  is  to  parallelize  an  existing  policy  iteration 
method  called  least-squares  policy  iteration.  The  algorithm  takes  advantage  of  the  multi-core 
or  multi-machine  architecture,  where  each  worker  (one  per  core  or  machine)  individually 
executes  a  fraction  of  episodes  and  estimates  a  parameter  while  a  consen-  sus  is  maintained 
by  parameter  averaging.  With  the  feedback  of  global  consensus,  each  worker  can  access  the 
information  learned  by  other  workers  at  the  previous  iterations.  As  the  result,  the  learning 
process  of  each  individual  worker  can  be  accelerated,  as  compared  to  learning  alone. 

Our  work  aims  at  answering  the  following  question:  Given  multiple  computational  resources, 
how  to  efficiently  solve  an  MDP?  In  our  problem,  each  worker  faces  the  same  MDP,  and  each 
worker  communicates  with  others  about  the  estimated  parameter  during  learning.  Thus,  our 
work  can  be  regarded  as  a  complement  to  multi-agent  MDP. 

Our  analysis  on  parallel  LSPI  shows  that  the  correlation  between  the  learning  processes  of 
each  individually  learned  model  can  influence  the  effectiveness  of  the  method.  The  com¬ 
putation  gains  achieved  with  parallel  LSPI  is  less  significant  when  there  exists  high 
correlation  between  workers.  To  deal  with  this  issue,  a  heuristic  is  proposed  to  encourage 
each  worker  to  explore  (i.e.  taking  random  action)  more  when  it  collects  samples,  which 
increases  the  randomness  and  in  turn  reduces  correlation. 

To  summarize,  we  propose  parallel  LSPI  to  efficiently  solve  an  MDP  through  parallel 
programming.  Our  method  can  also  balance  the  communication  overhead  and  required 
number  of  iterations  to  find  the  optimal  solution,  which  is  suitable  for  situation  when  only 
limited  bandwidth  is  available.  We  give  some  analysis  for  the  proposed  method  and  conduct 
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experiments  to  show  its  effectiveness  on  queueing  networks  and  persistent  search  and  track 
domains. 


Method/ Theory/  Experiment: 

Speeding  up  ADMM 


We  begin  by  describing  how  to  apply  ADMM  in  a  distributed  data  problem.  The  augmented 
Lagrangian  of  objective  (1)  is 


Lp (w,  c,  X)  -  CtW-1  li 


zi}oy  +  -crc+  e™=1- 


(w7i 


cy(wm- 


(2) 

The  algorithm  in  ADMM  consists  of 
wk+1  =  argmin_w  ^ w-  c*  'O  (3) 

c  k+1  =  argmin_c  ^p  Ci  4)  (4) 
kk+1  =  argmin_A^  C}  ^  (5) 

where  k  is  the  iteration  index,  w  is  {  wl,  . . .,  wg}  and  and  X  is  {  XI,. . .,  Xg} .  Note  that  solving 
(3)  is  similar  to  solving  the  objective  of  standard  SVM.  To  see  this,  we  rewrite  it  in  explicit 
form. 


w£+ 1  =  argminw  C  SE  e  »  max(  1-1,  zit  0)2  -f  ;  C wm  -  cK  - h 


ytwm  ~  ^  + 


243  m 

Therefore,  we  can  use  an  existing  optimization  methods  for  SVM  for  solving  (6). 

A  sampling  approach  for  fast  ADMM  on  primal  objective  (1) 

From  the  previous  section,  we  know  that  ADMM  requires  solving  the  SVM-like  problem  (6) 
at  every  iteration  of  ADMM.  Each  iteration  requires  many  inner  iterations  to  complete.  When 
local  data  is  large,  it  will  take  a  substantial  amount  of  time.  To  deal  with  this,  we  propose  a 
way  to  alleviate  the  training  cost.  At  the  earlier  iterations,  each  local  machine  only  uses  a 
subset  of  its  data  instead  of  using  all  the  data  to  update  its  learned  model.  As  the  algorithm 
continues,  each  machine  gradually  increases  the  amount  of  data  used  and  finally  reaches  the 
full  capacity.  This  method  enjoys  similar  fast  decrease  of  objective  value  as  SGD  does  at  the 
first  few  iterations  and  shares  the  same  convergence  rate  in  the  long  term  as  using  the  full 
dataset  every  iteration.  Thus,  for  each  iteration,  each  machine  solves  the  following  instead  of 
(6). 
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wm+1  =  argrninw  C  V  max(  1  -  k  z0  °) 2  +  r(wm  -  cft  +-  A*,  -  cK 

‘£’4 

+  (7] 

where  we  have  replaced  Nm  with  Nmk,  which  represents  the  amount  of  data  used  on  machine 
m  at  the  kth  iteration.  The  amount  of  data  used  at  the  k+1  iteration,  Nmk+1,  satisfies 

N^+l}  =  max{A£  X  £Nm)  (3) 

where  /3>1  is  the  increasing  rate.  Note  that  Nm  stands  for  all  of  the  data  of  machine  m.  For 
example,  we  can  initialize  Nm1  to  0.5  Nm,  and  set  f3  to  1.1,  meaning  the  amount  of  data  used 
for  training  at  each  iteration  in  ADMM  increases  by  10  percents  each  iteration.  The 
optimization  procedure  of  the  modified  algorithm  is  roughly  the  same  as  the  traditional  one;  it 
iterates  over  (3)-(5),  except  that  the  sub-problem  (3)  or  (6)  is  replaced  by  (8),  which  can  be 
solved  in  a  similar  manner  as  described  in  the  previous  section. 

A  sampling  approach  for  fast  ADMM  on  dual  objective  of  (1) 

We  give  another  method  that  still  follows  the  idea  of  sampling  subset  of  data  on  each  round  of 
communication.  This  section  begins  by  converting  the  primal  form  (1)  to  dual  form.  In  order 
to  make  the  dual  form  compact,  we  relax  the  constraint  and  approximate  it.  As  each  dual 
variable  corresponds  to  a  sample,  sampling  a  subset  to  update  the  model  becomes  easier  and 
natural.  We  then  show  how  to  integrate  the  sampling  idea  in  performing  ADMM  on  the  dual 
of  (1).  Solving  the  dual  of  (1)  by  ADMM  turns  out  to  be  equivalent  to  SDCA-ADMM.  We 
then  propose  some  techniques  to  efficiently  perform  SDCA-ADMM  for  distributed  data. 
Converting  primal  (1)  to  approximated  dual  form 

To  achieve  the  goal,  we  transform  each  feature  vector  from  p-dimensional  feature  space  to 
pg-dimensional  features  space.  That  is,  the  dimensions  of  augmented  feature  space  is  pxg, 
which  is  g  times  larger  than  the  original  feature  dimension  p. 

Let  us  denote  the  original  feature  vector  of  the  i  sample  on  the  m  machine  as  zz{i,m}. 

The  new  feature  is 

-  Dp  +  limp)  =  zz{iv} 

the  other  entries  in  Z{i,m}  are  set  to  zeros.  Thus,  the  data  matrix  Z  would  be  a  block  diagonal 
matrix.  Let  us  denote  the  diagonal  blocks  as  Z(m),  me  { 1,. .  .,g}. 

Note  that  Z(m)  is  the  submatrix  of  Zm  that  consists  of  original  features  zz{i,m}. 

We  now  turn  to  specify  a  matrix  B  that  encodes  the  constraint  in  objective  (1),  which  is 
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W-L  =  w2  =■■  =  Wg 

Since  the  feature  dimensions  becomes  pxg  dimensions,  so  does  the  corresponding 
classifier  w.  Denote  the  subvector  W(m)  the  mth  block  of  w.  Due  to  the  augmented  features  we 
design,  we  can  view  W(m)  as  a  model  learned  by  the  local  machine  m.  Then,  we  specify  the 
transform  B  to  be 


B  e  RP9X9  = 


■  1  p  1  p 


1  p  1  p 


-1  p\ 


lp  1  p  ) 


where  lp  means  the  p-dimensional  vector  of  all  l's.  Thus,  BTw  =  0  is  equal  to  the  constraint 
w-l  =  w2  =  —  =  Wg  that  encourages  the  model  associated  with  each  machine  to  agree  with 
each  other. 


To  make  the  dual  form  more  compact  and  simplified,  we  relax  the  constraint  and  propose  to 
use  a  regularization  term  ¥  so  that  ¥  (BTw)  can  have  the  similar  effect  of  the  constraint. 

We  choose  ¥  to  be  a  squared  of  L2  norm.  Thus,  ¥  ( BTw)  would  be 

3 

f?i=l 

where  wg+i  =  wi  The  regularization  penalizes  the  difference  of  a  machine  with  its  neighbors 
(in  terms  of  the  index  m)  and  encourages  the  sub  vectors  w  (m)  of  w  to  agree  with  each  other. 
Note  that  the  specification  of  the  transform  B  is  flexible.  If  we  have  a  prior  about  the  relation 
between  the  models,  we  can  easily  encode  it  in  the  transform. 

To  summarize,  the  conversion  to  the  approximated  dual  form  is 
minimize_w  (4ww)  + 

=  -  minimize_{x,y}S^=1  E"=i  w)  +  s-1-  Zx  +  By  =  0  (1 1) 

Here  f{i,m}  is  the  loss  function  associated  to  the  sample  i  on  machine  m,  and  the  symbol  *  is 
used  to  represent  the  corresponding  conjugate  function.  Note  that  x  and  y  here  are  called  dual 
variables  in  the  literature. 

Exploiting  diagoal  structures  of  transformed  feature  space 

Now  we  directly  apply  the  standard  ADMM  (e.g.  (3)~(5) )  to  solve  the  dual  problem  (1 1).  But, 
since  we  have  transfer  the  problem  into  the  high  dimensional  feature  space,  the  computations 
would  be  high.  Yet,  it  turns  out  that  we  can  leverage  the  structure  of  transformed  feature 
space.  In  the  following,  we  give  an  example  about  how  to  compute  Zr  x  wf  where  Z  is  the 
transformed  data  matrix  in  an(j  w  js  the  concatenated  classifiers  computed  by  each 
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local  machine  whose  dimensions  is  pg.  This  high-dimensional  matrix-vector  multiplication 
apppears  in  the  ADMM  updates.  The  direct  computation  causes  high  computation,  large 
memory  consumption,  and  very  frequent  communication. 


Our  example  given  is  g=3  cases  (i.e.  data  are  distributed  on  three  machines).  Suppose  Ii,  I2, 
and  I3  batches  of  dual  coordinates  are  chosen  at  current  iteration,  so  the  current  mini-batch 
I— {Ii, 12,13}.  Since  matrix  Z  is  a  block  diagonal  matrix  and  all  the  off-diagonal  blocks  are  zeros 
(which  are  filled  with  the  slash  lines  on  the  graph),  the  computations  can  be  decomposed  into 
smaller  components  Z(i)jjT  W(i>,  Z(2),i_2T  W(2),  and  Z(3),i_3T  W(3),  each  is  independently 
computed  on  the  respective  local  machine.  Thus,  the  unnecessary  computation  and 
communication  can  be  avoided. 


ZT 


m  zuwJwfu 
n  Z(2)ri!T  wj2) 
Q  Z(3},lJw(z) 

Z,1  w 


Fig.  1.  An  example  to  illustrate  the  way  to  leveraging  the  structure  of  transformed  data  matrix 
to  improve  the  computing  performance. 


To  summarize,  our  ADMM  on  the  dual  problem  is  shown  in  (11).  The  algorithm  is  a  variant 
of  ADMM  applied  on  the  dual  domain  which  we  have  derived  and  the  trick  we  desribed 
above.  By  variant,  we  mean  that  in  each  iteratoin,  a  batch  of  dual  coordinates  x  are  updated 
instead  of  all  the  coordinates,  which  is  in  the  spirit  of  SDCA. 
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Algorithm  1  Distributed  SDCA-ADMM 
Input:  parameters  p.j^rjzjjTjB 
Initialize  xq  —  0,yo  —  0,  wo  —  0 


for  t  —  1  to  T  do 

1,  The  master  receives  and  update 

yii)  =q(t)  -  prox(q^\nHf(priB  *  )/(pWb)) 

2 «  The  master  broadcasts  y^K 

3.  Each  local  machine  m  randomly  chooses  a  mini-batch  Jm,  so  totally 

9 

V  |Jm|  of  the  dual  variables  x  are  updated  at  the  k  iteration. 

m— 1 

4.  Each  local  machine  updates  6  Im- 


4,712 


prox(r\ 


(*)  |  fj , , 


for  each  i  E  /m.  where  is  the  bfc  element  of  rlrJ  ,  and 

4,171  l  77J  ! 


(t) 

r)  =  x 


(t-i) 


i,mi  j  ■ 


{*> 


(t-i) 


Ml 


5.  Each  local  machine  updates 


(t) 

J(m) 


J(m) 


(t-i) 


and  then  computes  c,  ,  and  send  it  to  the  master. 


CM  =f£i»l 

C(m) 


B?-'  '  (t) 


-(m) 


{*C)  -  P(Z("»)®(m)  + 


,(*)! 


PT?S 


Fig.  2.  Algorithm  for  operating  on  dual  domain 


Communication-Efficient  Online  Semi-Supervised  Learning  in  Client-Server  Settings 


Our  framework  is  designed  based  on  the  above  considerations.  It  can  be  decomposed  into 
several  components  that  drive  different  functionalities.  On  the  client  side,  we  per-  form  data 
triage  by  selecting  instances  from  a  candidate  pool,  where  the  selection  criterion  is  controlled 
by  the  server.  On  the  server  side,  an  online  semi-supervised  learning  algorithm  is  employed  to 
handle  unlabeled  submissions.  The  key  is  to  maintain  two  learners — a  graph-based 
semi-supervised  model  and  a  linear  classifier — and  let  them  collaborate  to  exploit  unlabeled 
data.  Specifically,  incoming  instances  are  added  to  the  training  set  of  the  first  learner,  which  is 
represented  by  a  graph.  The  nodes  of  the  graph  are  instances,  and  the  edges  between  nodes 
reflect  the  similarity  between  the  corresponding  instances.  Then,  the  first  learner  predicts 
labels  for  all  unlabeled  instances  in  the  graph,  and  randomly  samples  an  instance  according  to 
the  confidence  of  its  predictions  in  order  to  teach  the  second  learner.  The  second  learner 
updates  its  hypothesis,  and  delivers  a  new  selection  criterion  to  the  client.  At  any  time,  the 
second  learner  can  be  used  as  a  standalone  model  for  predicting  new  test  data. 

While  different  machine  learning  algorithms  can  be  used  as  a  part  of  this  framework,  some 
techniques  lend  themselves  to  our  problem  setting  better  than  others.  In  this  work,  we  use  the 
harmonic  solution  (HS)  as  the  first  learner  and  the  soft  confidence-weighted  classifier  (SCW) 
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as  the  second  leaner.  Our  choice  offers  several  advantages.  First,  SCW  is  simple,  fast  and 
enjoys  state-of-the-art  performance  on  classification.  Second,  SCW  performs  a  conservative 
update  especially  with  noisy  labels.  Third,  SCW  can  be  parameterized  by  a  weight  vector  and 
a  covariance  matrix,  allowing  the  server  to  deliver  the  selection  criterion  to  the  client  with  a 
low  communication  cost.  In  this  work,  we  simply  transmit  the  weight  vector  of  SCW  to  the 
client.  On  the  other  hand,  HS  nicely  complements  SCW  by  providing  feedback  using  the  data 
manifold.  It  can  leverage  the  similarities  between  instances,  which  is  something  that  SCW 
overlooks,  to  deter-  mine  labels  of  unlabeled  data.  By  peering  these  two  models  together,  we 
enjoy  the  best  of  both  worlds,  efficient  learning  and  simple  parameterization  due  to  SCW,  and 
the  ability  to  exploit  manifold  information  disclosed  by  unlabeled  examples  due  to  HS. 
Moreover,  SCW  and  HS  can  be  incorporated  into  a  single  optimization  problem. 

One  may  find  it  is  debatable  whether  a  two-learner  structure  is  really  a  preferable  choice 
comparing  to  a  single  learner.  For  example,  one  of  the  alternatives  is  to  train  a  linear  classifier 
using  its  own  predicted  labels  without  leveraging  data  manifold  information.  Unfortunately, 
such  an  idea  is  not  effective  according  to  our  experiments.  Sometimes,  the  results  are  even 
worse  than  not  using  any  unlabeled  data.  The  reason  is  twofold.  First,  a  single  unlabeled 
instance  can  hardly  provide  any  useful  information.  Second,  most  of  the  online  linear 
classifiers  only  return  a  single  hypothesis  on  each  round,  precluding  any  other  possible 
hypotheses.  Hence,  some  previous  work  employed  Bayesian  methods  to  update  a  (posterior) 
distribution  over  the  hypothesis.  Unfortunately,  the  posterior  is  often  complicated.  It  is  not 
known  how  to  perform  the  update  analytically.  Therefore,  the  learning  process  can  be  easily 
misled  and  stuck  in  a  wrong  direction.  Another  alternative  is  to  use  a  graph-based  method 
solely.  However,  due  to  the  nonparametric  nature  of  graph-based  methods,  it  is  not 
straightforward  to  deliver  the  server’s  model  to  clients  with  a  low  communication  cost  (for 
the  same  reason,  nonparametric  methods  are  not  favorable  in  our  problem  setting).  Moreover, 
graph-based  methods  are  also  less  efficient  for  predicting  new  data,  as  they  usually  involves 
matrix  inversion.  A  two-  learner  structure,  in  contrast,  surmounts  the  above  problems  by 
complementing  each  other’s  drawbacks.  The  choice  of  two  learners  with  different  underlying 
mechanisms  is  a  key  for  good  performance. 

If  we  define  the  communication  cost  as  the  total  number  of  vectors  in  Rd  transmitted  over  the 
network,  then  a  straight-  forward  implementation  of  our  proposed  framework  incurs  a 
cost  of  at  most 
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client  to  server 


server  to  client 


Parallel  least-squares  policy  iteration 
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We  propose  the  parallel  least-squares  policy  iteration  to  handle  the  large-scale  learning 
problem.  The  setting  is  that  there  are  M  workers  (cores)  available  (either  multiple  machines  or 
multiple  cores  on  a  single  machine)  for  computations.  To  fully  exploit  the  available 
computational  resources,  each  worker  m  collects  samples  and  runs  by  itself,  and  then  updates 
its  estimated  A_1m,  bm,  and  0m.  At  some  point  during  learning,  it  communicates  the  learned  0m 
with  other  workers. 

The  algorithm  is  shown  in  Algorithm  1  in  [b3].  For  every  outer  iteration  t,  each  worker  m 
individually  collects  samples  by  following  e-greedy  over  K  episodes.  When  collecting 
samples,  each  worker  also  incrementally  updates  Am  and  &m.  After  conducting  K  episodes 
of  learning,  each  worker  also  reuses  the  samples  collected  at  previous  iteration  to  updates 
and  &m.  Then,  each  worker  updates  0m  and  sends  it  to  the  master.  The  master  then  averages 
the  models  to  obtain  the  consensus  z.  and  broadcasts  it  to  all  the  workers.  The  workers  then 
update  the  policy  with  the  new  consensus  and  proceed  to  the  next  iteration.  After  T  iterations, 
parallel-LSPI  outputs  the  most  recent  consensus  zt. 

In  parallel  LSPI,  each  worker  m  communicates  to  the  master  only  after  updating  its  estimator 
0m,  which  occurs  when  it  has  executed  sufficient  number  of  episodes.  This  is  the  strategy  that 
balances  between  communication  overheads  and  required  iterations  to  the  optimal  solution.  If 
the  algorithm  dictates  each  worker  to  communicate  right  after  every  episode,  communications 
overheads  becomes  heavy.  If  each  worker  independently  runs  the  episodes  during  training  and 
the  parameters  are  only  averaged  at  the  end,  the  communication  overhead  would  be 
minimized  but  it  may  not  lead  to  satisfactory  results  as  the  information  from  other  workers  is 
completely  ignored  during  training.  Thus,  a  better  strategy  is  to  strike  a  balance  between  the 
two  extreme.  Compared  to  the  related  work  of  parallel  TD  that  only  allows  small  K  (roughly 
K  <  5),  the  proposed  parallel  LSPI  can  reduce  communication  cost  by  allowing  the  workers  to 
run  sufficient  amount  of  episodes  before  mixing.  Note  that  the  underlying  core  of  LSPI  is 
least-squares  temporal  difference  learning  (LSTD_Q),  which  is  naturally  a  batch  method  in 
contrast  to  TD  learning.  Therefore,  it  does  not  need  to  update  0  for  every  transition  as  TD 
learning  does.  Thus,  parallel  LSPI  naturally  enjoys  the  benefit  of  parallelization  without  the 
burden  of  frequent  communication. 

ANALYSIS 

Here  we  analyze  the  sample  complexity  of  the  proposed  method.  As  for  non-parallel 
LSPI  did,  the  analysis  is  first  performed  on  a  version  of  LSTD  called  pathwise-LSTD 
for  policy  evaluation.  It  analyzes  LSTD  at  the  states  along  a  sampled  trajectory 
following  a  given  policy.  As  there  are  M  workers  (and  M  trajectories)  with  parameter 
averaging  conducted  in  our  case,  we  have  to  analyze  the  averaged  estimated  parameters 
from  the  trajectories.  Then,  one  may  generalize  the  analysis  over  entire  state  space 
under  certain  condition  and  derive  the  finite-sample  bound  of  parallel  LSPI  in  turn. 
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Thus,  here  we  focus  on  parallel  pathwise-LSTD  as  the  insight  on  the  sample 
complexity  can  already  be  seen  in  this  step. 

Before  explaining  pathwise-LSTD,  let  us  first  describe  the  notations.  As  an  MDP  is 
reduced  to  a  Markov  chain  given  a  policy  n,  let  (X\,  Xi ,...,Xn)  be  a  trajectory  of  size  n 
generated  by  the  Markov  chain.  With  abuse  of  notation,  we  denote  CD  = 
[(p(Xi)T;...-,(p(Xn)T]  as  the  feature  matrix  defined  along  the  trajectory.  The  estimated 
value  function  is  thus  constrained  on  the  feature  space  F  =  {(Dtf,  0  6  Rrf}. 
Pathwise-LSTD  takes  the  feature  matrix  CD  generated  by  a  trajectory  following  as  input. 
It  builds  the  empirical  transition  matrix  P  '■  =  ^{i  =  *  +  /  n},  and  sets 

the  quantities  A  =  CD T(J-  rP )  CD,  and  b=0,r.  It  then  outputs  the  solution  d  =  A+b  with 
minimum  norm,  where  A+  represents  the  Moore-Penrose  pseudo-inverse  of  A. 


Let  us  denote  v  as  the  value  function  and  its  estimated  one  along  the  trajectory,  and 

„  71 

II  /’II2  =  —  S  f(X  )2 

{X,}Li  J  n  nf=r  '  as  the  empirical  norm.  Moreover,  let  Vmax  represent  the 


feature  space,  and  v  be  the  smallest  positive  eigenvalue  of  the  maximum  of  the  value 
function,  II  be  the  projection  to  the  feature  space,  and  v  be  the  smallest  positive 
eigenvalue  of  Gram  matrix  ®7®/n.  From  Theorem  1,  HW-1UII"  is  bounded  as 


\v  -  v  L  < 
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with  high  probability  1  -  5. 


For  parallel  pathwise-LSTD,  denote  (. X\,m,X2,m ,  ...,  Xn,m)  as  the  mth  trajectory  of  the 
Markov  chain  induced  by  a  policy  n,  ®m=  \(p(X\,m)T  ■,...\(p(Xn,m)T]  as  the  corresponding 
feature  matrix,  and  vm  as  the  smallest  positive  eigenvalue  of  the  corresponding  sample 
based  Gram  matrix  Vmt  gm  =  <i>inem  of  features.  Moreover,  let  \\v’  -i>'o\\n 
represent  the  pathwise  solution  of  the  trajectory  m,  andrepresent  the  value  function  ||</  - 
and  the  estimated  one  at  the  states  along  trajectory  m  respectively.  Since  parallel  LSPI 
conducts  parameter  averaging,  we  are  interested  in  the  sample  complexity  associated 
with  the  averaged  estimator  Let  ^'represent  the  new  policy  at  the  next 

iteration  of  parallel  LSPI  and ^1-^2-  •  ■  •■X'n  be  the  trajectory  of  the  Markov  chain 
following  Trwith  ®  being  the  corresponding  feature  matrix.  Then,  we  want  to  analyze 
the  quantity  as  it  would  give  us  insight  on  the  sample  complexity.  To  estimate  the  upper 
bound,  we  make  the  following  assumption  that  connects  the  performance  of  0m 
evaluated  at  the  new  trajectory  {A('}"=i  to  the  one  evaluated  at  the  original! At.mlJLi 
where  6m  is  estimated  from. 
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Assumption  There  exists  a  constant  c  such  that  the  empirical  norm  of  the  difference 

Q 

between  value  function  v  and  the  one  implied  by  m  evaluated  at  the  trajectory 
is  upper  bounded  as  IK  -  <  c?\\vm  -  $ra0m||£. 

Notice  that  the  empirical  norms  on  both  sides  of  the  inequality  are  evaluated  at 
different  trajectories;  the  left  hand  side  is  onWKW,  while  the  right  hand  side  is  on 
{  ^i  mH=if0r  an  m.  Using  the  assumption,  we  can  bound,  ||t/  -  $'0||n  which  measures 
the  performance  of  the  averaged  estimator. 

Proposition  Following  the  above  assumptions,  if  (v>  —&Qm)of  each  m  is  near 
orthogonal,  we  have 
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where  vmin  represents  the  smallest  of  the  vm,  and  Maxm  nmi>ra  II  ^represents  the 

largest  ||v«i-nmvm||n  of  the  terms.  If  (v>  ~  )  are  highly  correlated,  we  have 
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The  proposition  suggests  that,  for  the  ideal  case,  parallel  LSTD  can  estimate  the 
value  function  at  an  improved  rate  0(1  A/Afn )  of  each  worker  comparing  to  the  original 
rate  of  O(lWn).  This  implies  that  the  effectiveness  of  process  in  which  M  workers 
collect  some  n/M  samples  in  parallel  is  comparable  to  a  single  worker  collecting  n 
number  of  samples.  Yet,  the  sampling  is  conducted  in  a  distributed  fashion  in  parallel 
LSTD  as  compared  to  standard  LSTD.  At  the  other  extreme,  for  the  worst  case  scenario, 
parallel  LSTD  has  the  same  sample  rate,  0(1/Vn),  for  each  worker  as  that  of  standard 
LSTD.  Yet,  each  worker  individually  collects  n  samples  meaning  that  the  total  samples 
in  parallel  LSTD  are  M  times  larger  than  that  of  the  non-paralleled  one. 


M 


Proof:  First,  let  us  rewrite^'  ^  as  .  Suppose,  for  each 

m,  ( v 1  -  js  nearly  mutual  orthogonal.  Then,  we  have 
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(15) 


where 


G —  - - —  M(lXm{  ||um 


V  1  —  7: 


8log(2d/6 )  |  1 


(16) 
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Taking  square  root  on  both  sides  gives  us  the  result. 


In  contrast,  if  all  (v'  —  are  highly  correlated,  we  have 


which  reduces  to  the  case  of  a  single  worker. 

To  achieve  better  performance  for  parallel  LSPI,  according  to  the  proposition,  we 


should  make  as  less  correlated  to  each  other  as  possible. 


However,  (v'  -  is  unknown  in  advance.  To  deal  with  the  issue,  a  heuristic  is 

used  to  enforce  each  worker  to  take  random  actions  more  when  collecting  samples. 
Since  e-greedy  is  adopted  here,  a  larger  that  encourages  more  exploration  (i.e. 
randomly  choosing  an  action)  will  bring  us  closer  to  the  goal.  If  gets  close  to  zero,  the 
randomness  would  mostly  come  from  the  transitions  P  of  the  process.  In  this  case, 
performance  of  parallel  LSPI  may  not  achieve  significant  speedup,  depending  on  the 
underlying  MDP  and  the  feature  space.  Note  that  the  degree  of  correlation  between 

fj 

m  is  not  equivalent  to  the  degree  of  the  correlation  between  («'  -  <L>'om)  Still,  our 
experiments  reveal  that  such  heuristic  does  have  positive  effect  on  the  performance. 

Results  and  Discussion: 

Speeding  up  ADMM 

We  compare  our  methods  with  standard  ADMM  and  distributed  SDCA  on  several  datasets. 
We  denote  our  first  method  which  performs  ADMM  on  primal  objective  (1)  with  sampling  as 
ADMM-P,  and  we  denote  the  second  method  which  performs  distributed  SDCA-ADMM  on 
dual  objective  (9)  as  ADMM-D. 
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The  datasets  are  binary  classification.  For  dataset:  delta,  gamma,  and  ocr  are  available  on 
http://largescale.ml.tu-berlin.de/about,  while  for  dataset:  mnist,  epsilon,  and  rcvl  are  available 
on  http://www.csie.ntu.edu.tw/~cilin/libsvmtools/datasets/.  Table  1  shows  their  statistics.  For 
delta,  gamma,  and  ocr,  because  only  the  labels  of  original  training  data  are  available,  we  split 
each  of  them  by  80%  as  training  set  and  the  remain  20%  as  testing  set.  For  mnist,  we  choose 
classifying  digit  4  against  digit  7  as  the  recognition  goal  and  also  split  the  data  into  80/20  split. 
For  epsilon,  we  use  the  pre-defined  training/testing  split.  For  rcvl,  the  ratio  of  original 
training  to  testing  data  size  is  much  less  than  one,  so  we  re-split  the  data  into  80/20  split.  The 
data  are  further  distributed  on  four  machines  in  our  workstation.  The  batch  size  III  is  set  to  100 
(i.e  each  of  four  machines  uses  25  samples  at  a  time)  for  both  SDCA  and  ADMM-D  on  all  the 
datasets  except  rcvl  where  the  batch  size  $111$  is  set  to  1000. 


Data 

number  of  samples 

dimension 

Data 

number  of  samples 

dimension 

delta 

500,000 

500 

ocr 

3,500,000 

1,156 

gamma 

500,000 

500 

epsilon 

500,000 

2,000 

mnist47 

1,634,445 

784 

rcvl 

697,641 

47,236 

Table  1.  Dataset  Statistics 

There  are  some  parameters  needed  to  be  specified  in  ADMM-D:  p,y,r|{z,i},  and  t|b.  For  y,  we 
set  it  as  y=l/n.  For  t|b,  due  to  the  choice  of  transform  B,  simple  calculation  would  show  that  it 
should  be  at  least  gp  (i.e.  the  value  no  less  than  the  multiplication  of  number  of  machines  g 
and  feature  dimensions  p).  For  r|{z,i},  it  should  be  larger  than  the  largest  eigenvalue  of  ZTi  Zi 
associated  with  the  mini-batch  I.  But  it  is  time  consuming  to  compute  it  for  each  mini-batch  at 
the  augmented  feature  space.  Instead,  before  running  the  optimization,  we  just  sample  a 
mini-batch,  say  I,  and  calculate  the  largest  eigenvalue  of  ZTi  Zi  at  the  original  feature  space. 
Denote  the  computed  value  as  Tptmp}.  We  set  every  rpzj}  to  the  same  value:  r|{z,i}  =  0,  where  0= 
5  x  10d  with  the  smallest  power  d  such  that0>r|{tmp}.  For  p,  it  is  set  to  10d  with  d'  chosen  to 
satisfy  p  x  rpzj}  =  5.  We  found  the  heuristic  work  well. 

The  parameters  for  standard  distributed  ADMM  is  set  to  the  default  setting,  while  for  the 
ADMM-P,  we  set  the  additional  parameter  k  to  1.5,  which  means  that  samples  used  is 
increased  by  1.5  times  at  subsequent  iterations  (we  did  not  find  any  value  that  significantly 
leads  to  better  results.).  Since  the  objective  of  each  method  is  not  the  same  (i.e.  for  ADMM-D, 
the  objective  is  shown  on  (9);  for  standard  ADMM  and  ADMM-P,  the  objective  is  shown  on 
(1);  for  parallel-SDCA,  the  objective  is  hinge  loss  with  L2  regularization  without  any 
constraint  as  compared  to  the  others.),  we  tune  the  regularization  parameter  C  such  that  each 
method  can  achieve  to  the  best  accuracy  on  each  dataset. 
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Fig.  2  shows  the  results,  which  are  about  training  accuracy  vs.  time.  Each  curve  represents  a 
different  method.  For  ADMM  or  ADMM-P,  each  point  on  a  curve  is  the  classification  result 
of  a  model  at  a  corresponding  iteration,  while  for  distributed  SDCA  or  ADMM-D,  each  point 
is  the  result  at  every  the-number-of-batches  iterations  (which  is  roughly  equivalent  to  a  full 
pass  of  data.).  We  run  the  distributed  SDCA  and  ADMM-D  ten  times  for  each  dataset  so  the 
results  are  the  averaged  ones. 


ocr 


epsilon 


rcvl 


Fig  2.  Training  accuracy  vs.  time  for  each  method  on  different  datasets. 


From  the  figures,  we  see  that  ADMM-D  converges  the  fastest  on  most  of  the  datasets  except 
rcvl  among  the  methods.  The  figure  also  shows  that  ADMM-D  and  distributed  SDCA 
outperform  standard  ADMM  and  ADMM-P  on  most  of  the  cases,  while  ADMM-D  is  better 
than  distributed  SDCA  on  gamma  and  epsilon.  On  other  data,  ADMM-D  is  comparable  or 
slightly  better  than  distributed  SDCA.  This  indicates  that  ADMM-D  (distributed 
SDCA- ADMM)  can  enjoy  the  merits  of  SDCA  and  ADMM  and  outperform  them  as  a 
consequence.  The  figure  also  shows  ADMM-P  is  at  least  comparable  to  standard  ADMM.  On 
rcvl,  standard  ADMM  seems  to  be  better  than  ADMM-D  and  distributed  SDCA.  Note  that 
the  feature  dimensions  of  rcvl  data  is  much  larger  than  the  other  datasets.  As  the  dimensions 
increases,  computations  such  as  matrix-vector  product  also  scales  with  the  dimensions.  For 
standard  ADMM,  the  algorithm  follows  a  strategy  used  in  FIBSVM,  where  it  maintains  and 
updates  a  active  set  such  that  the  dual  variables  associated  with  samples  outside  the  active  set 
do  need  to  be  updated  anymore.  This  means  that  the  computational  time  is  decreased  through 
iterations  of  standard  ADMM,  which  alleviates  the  suffer  of  high  dimension.  While  in  SDCA 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


or  ADMM-D,  at  each  iteration,  it  still  samples  a  pre-determined  batch  size  of  data.  Thus,  an 
interesting  future  work  is  exploiting  the  active  set  strategy  into  distributed  SDCA  and 
ADMM-D.  To  summarize,  ADMM-D  is  a  effective  method  on  medium  feature  size  data.  If 
frequent  communication  is  admissible  (as  in  our  setup),  we  suggest  to  use  ADMM-D, 
otherwise  use  ADMM-P. 

We  propose  sampling-based  ADMM  approaches  for  learning  from  distributed  data.  We 
integrate  the  idea  from  stochastic  gradient  descent  into  ADMM.  Our  first  method  uses  subset 
of  data  on  early  rounds  of  communication,  which  can  reduce  the  cost  on  early  stage  while 
enjoy  the  similar  convergence  rate  as  the  standard  one.  We  also  transform  the  primal 
objective  into  the  approximated  dual  form  and  propose  a  distributed  variant  of  the  recently 
proposed  SDCA- ADMM  to  solve  it. 

Communication-Efficient  Online  Semi-Supervised  Learning  i 

Experiments  were  conducted  on  seven  data  sets  downloaded  from  either  the  UCI  ML 
repository  (wearable,  skin)  or  the  LIBSVM  website  (mushroom,  mnist,  webspam,  gisette, 
ijcnnl).  The  motion  recognition  data  set  wearable  and  digit  recognition  data  set  mnist  were 
converted  into  a  set  of  bi-  nary  problems,  respectively,  where  each  class  is  discriminated 
against  every  other  class.  Totally,  we  produced  10  problems  from  wearable  and  45  from  mnist. 
For  each  data  set,  we  balanced  the  number  of  instances  of  each  class  and  linearly  rescaled  the 
feature  values  into  the  range  [-1,  1]. 

We  evaluated  the  algorithms  using  a  set  of  trials  with  different  partitions  of  the  training  and 
test  data.  In  each  trial,  we  randomly  held  out  half  of  the  data  for  testing;  all  instances  in  the 
test  set  were  labeled  by  the  algorithms.  The  remaining  data  was  used  for  training,  of  which 
only  a  small  amount  was  labeled.  Both  training  and  test  sets  were  class-  balanced.  Next,  we 
randomly  permuted  the  training  data  and  kept  labeled  data  always  at  the  beginning.  All 
algorithms  were  then  incrementally  trained  with  the  same  permutation  in  each  trial.  For 
evaluation,  we  paused  the  training  at  regular  intervals,  computed  the  output  hypothesis  so  far, 
and  calculated  its  test  accuracy. 
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Fig.  3.  Test  accuracy  of  different  models  on  ifie  server.  The  s-asis  represents  lire  number  of  Linlabeled  instances  on  the  client.  The  origin  corresponds  to  the 
point  where  the  initial  2%  of  the  data  has  been  labeled  and  learned  and  the  first  linlabeled  instance  comes  in.  The  client  randomly  selected  10  instances  from 
every  50  instances,  full  is  an  idealized  approach  in  which  an  oracle  labels  all  selected  instances,  none  does  not  upload  any  unlabeled  instance  to  the  server, 
so  the  corresponding  lest  accuracy  is  constant.  A  tournament  result  between  algorithms  on  all  problems  of  mnist  and  wearable  are  presented  in  Tables  [Fa] 
toflthl 


Comparison  of  Server’s  Model 

Fig  3.  shows  the  results.  It  can  be  observed  that  proposed  hs+scw  and  hs+scw+cut  enjoy 
superior  performance  on  8  out  of  10  problems  comparing  to  other  partial  label  competitors. 
On  45  mnist  problems,  hs+scw  and  hs+scw+cut  yielded  on  average  0.966  and  0.971  accuracy, 
respectively.  On  20  wearable  problems,  hs+scw  and  hs+scw+cut  gave  0.699  and  0.714 
accuracy,  respectively.  They  are  consistently  better  than  the  single-learner  counterpart  sew  on 
all  data  sets.  This  indicates  the  effectiveness  of  leveraging  manifold  information  of  the  graph. 
In  fact,  on  webspam,  ijcnnl  and  wearable,  sew  is  even  worse  than  none.  On  webspam,  its  test 
accuracy  starts  with  0.658,  decreasing  over  time  and  finally  yielded  0.637.  This  is  due  to  the 
fact  that  sew  completely  relies  on  its  own  prediction  for  learning.  When  the  labeling  rate  is 
small,  the  initial  hypothesis  constructed  by  labeled  data  may  not  be  accurate  enough.  As  a 
consequence,  the  prediction  of  sew  on 


Comparison  of  Selection  Strategy 

Fixing  the  model  on  the  server  as  hs+scw+cut,  we  study  the  following  strategies  on  the  client 
side. 

All,  All  unlabeled  instances  are  uploaded  without  selection.  This  incurs  5x  the 
communication  costs  versus  other  approaches, 
rand  Randomly  selects  instances  for  uploading. 

certain  The  most  certain  instances  according  to  the  current  server  model 
uncertain  The  most  uncertain  instances  are  uploaded. 

submod  Selection  is  done  by  optimizing  the  submodular  function  described  in  Section  VI.  It 
simultaneously  considers  the  uncertainty  and  redundancy. 
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TABLE  2  Performance  comparison  of  different  selection 

STRATEGIES.  THE  VALUE  IN  THE  TABLE  DENOTES  THE  NUMBER  OF  TIMES 
THAT  THE  ROW  ALGORITHM  ACHIEVED  SIGNIFICANTLY  BETTER 
ACCURACY  THAN  THE  COLUMN  ALGORITHM  UNDER  THE  t-THST  WITH 


p  =  0.05. 

(a)  Different  server's  models  on  mnist  (45  problems). 
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(b)  Different  servers  models  on  W&arable  (10  problems). 
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(c)  Different  selection  strategies  on  imnist  (45  problems). 
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Id)  Different  selection  strategies 

on  wearable  (10  problems). 
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The  results  are  shown  in  Table  2.  It  is  interesting  to  see  that  All,  which  transmits  all  unlabeled 
data,  does  not  lead  to  a  better  performance.  In  fact,  on  mnist,  mushroom,  and  gisette,  All 
yields  worse  test  accuracy  compared  to  selective  transmission.  This  confirms  the  intuition  that 
not  all  unlabeled  instances  are  useful.  It  also  suggests  the  necessity  of  using  a  selective 
sampling  strategy  on  the  client.  Not  only  the  communication  costs  can  be  saved,  but  also  a 
better  model  might  be  learned. 
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Parallel  least-squares  policy  iteration 

We  conduct  the  experiments  on  two  domains.  One  is  the  discrete-time  four-dimensional 
queuing  network.  Figure  4  illustrates  the  network,  which  includes  four  queues,  each  with 
buffer  size  B.  Here  server  1  can  only  serve  queue  1  or  4,  and  server  2  can  only  serves  queue  2 
or  3  one  at  a  time  but  not  simultaneously.  Each  server  can  only  handle  one  customer  at  a  time 
at  most.  Moreover,  neither  server  can  be  idle.  Let  the  tuple  {al,a2,a3,a4}  represents  an  action 
combination  which  the  servers  take  by  considering  conditions  in  the  queues  {ql,q2,q3,q4}, 
where  ai  =  {1,0}  indicates  whether  qi  is  currently  being  served  or  not.  Then,  there  are  total 
four  actions  {1,1, 0,0}  {0,1, 0,1}  {1,0, 1,0}  {0,0, 1,1}  the  servers  can  take.  As  the  result,  the 
number  of  states  is  (1+B)4x4,  which  means  that  a  modest  B  will  result  in  a  huge  state  space. 

The  dynamics  of  the  network  are  defined  by  the  rate  parameters  pl,p3,dl,d2,d3,d4  E  (0,1), 
all  follow  Bernoulli  distribution,  pi  and  p3  are  coming  rates  of  new  customers.  At  each  time 
step,  with  probability  pi,  a  new  customer  comes  to  queue  i.  di  is  defined  as  follows:  if  ai  =  1, 
which  indicates  queue  i  is  being  served,  the  server  would  succeed  in  handling  a  customer  with 
probability  di  before  the  next  time  step,  and  fail  with  probability  1  -  di.  Starting  with  empty 
queues,  each 


server  1  server  2 


Fig.  4.  The  discrete-time  four  dimensional  queuing  network.  Customers  can  arrive  at  ql  or  q3. 
The  customer  that  is  served  and  finished  by  ql/q3  is  then  referred  to  q2/q4. 

episode  spans  a  fixed  number  of  time  steps,  T.  The  goal  is  to  minimize  the  average  of  total 
waiting  (unserved)  customers  in  the  network  during  an  episode.  The  loss  for  a  state-action 
pair  is  defined  as  l(s,a)  =  l(s)  =  IXI,  which  is  the  total  number  of  unserved  customers  in  all  the 
queues.  After  an  episode,  the  network  is  reset  to  empty  and  a  new  episode  begins. 

We  consider  four  types  of  networks: 

1)  pi  =  0.5,p3  =  0.5, dl  =  0.5,d2  =  0.8,d3  =  0.8,d4  =  0.5,  episode  duration  T  =  50,  and 
buffer  size  B  =  5,  which  results  in  5,184  state-action  pairs. 

2)  pi  =  0.5,p3  =  0.8, dl  =  0.5,d2  =  0.1, d3  =  0.8,d4  =  0.8,  episode  duration  T  =  100,  and 
buffer  size  B  =  10,  which  results  in  58,564  state-action  pairs. 

3)  pi  =  0.4, p3  =  0.4, dl  =  0.5,d2  =  0.8,d3  =  0.3,d4  =  0.3,  episode  duration  T  =  200,  and 
buffer  size  B  =  15,  which  results  in  262,144  state-action  pairs. 
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4)  pi  =  0.4, p3  =  0.4,dl  =  0.4, d2  =  0.8,d3  =  0.8,d4  =  0.4,  episode  duration  T  =  200,  and 
buffer  size  B  =  15,  which  results  in  262,144  state-action  pairs. 

We  design  340-dimensional  sparse  binary  feature  for  type  1  network  and  1048-dimensional 
features  for  the  others.  In  our  design,  only  two  entries  in  the  feature  are  non-zeros  for  each 
state-action  pair. 


Another  domain  is  the  persistent  search  and  track.  The  scenario  is  that  there  are  three 
Unmanned  Aerial  Vehicles  (UAV)  to  corporate  for  a  mission.  There  are  three  available 
actions  for  each  UAV:  {advance, retreat, loiter},  resulting  in  27  total  possible  action 
combinations.  The  current  state  of  a  UAV  is  described  by  :  location,  fuel,  actuator  status,  and 
camera  status.  The  goal  is  to  fly  to  the  target  site  and  perform  surveillance,  while  ensuring 
that  there  is  a  UAV  with  a  working  actuator  loitering  at  the  intermediary  site  to  transfer  the 
information  of  the  targets  to  the  base. 


We  modify  the  scenario  because  the  reported  performance  of  LSPI  is  not  good  in  the  original 
setting.  Each  UAV  starts  from  the  base  with  6  units  of  fuels.  The  camera  and  actuator  of  each 
UAV  can  may  with  a  3%  probability  at  each  time  step.  The  camera  cannot  function  under 
failed  actuator,  so  a  UAV  with  a  failed  actuator  cannot  perform  surveillance.  Yet,  a  UAV  can 
perform  communication  even  its  camera  malfunctions.  A  successful  surveillance  mission 
must  have  at  least  one  UAV  with  working  actuator  at  the  intermediary  site,  and  at  least  one 
UAV  with  working  actuator  and  camera  at  the  surveillance  site.  At  each  time  step,  each  UAV 
loses  1  unit  of  fuel  except  when  it  “loiters”  at  the  base  or  at  the  intermediary  site.  When  a 
UAV  “loiters”  at  the  base,  the  failed  camera  and  actuator  are  fixed,  and  the  fuels  are 
recharged  fully.  When  a  UAV  with  working  actuator  “loiters”  at  the  intermediary  site,  the 
messages  is  transmitted  to  the  base 
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Fig.  5.  Persistent  search  and  track. 


and  UAV’s  fuel  tank  is  recharged  by  2  units  (the  fuel  cannot  exceed  the  capacity,  which  is  6 
units).  If  a  UAV  “retreats”  at  the  base,  then  it  may  “advance”  to  the  intermediary  site  or 
“loiter”  at  the  base  with  equal  probability.  Executing  “advance”  action  at  the  surveillance  site 
has  similar  effect.  The  reward  for  each  state-action  pair  is  defined  as  r(s,a)  =  15xIcomm  x 
Isurv-10xIcrash-(  18-total  remained  fuels),  where  Icomm  indicates  whether  there  is  a  UAV 
with  working  actuator  at  the  intermediary  site,  Isurv  indicates  whether  there  is  a  UAV  with 
working  actuator  and  camera  at  the  surveillance  site,  and  Icrash  represents  whether  a  UAV 
crashes.  If  a  UAV  runs  out  of  fuel,  which  means  it  crashes,  the  episode  is  terminated.  In  total, 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


the  state-action  pair  has  the  size  of  about  1.6x106,  and  about  2000-dimensional  sparse  binary 
feature  including  fixed  sparse  representation. 

B.  Setup  and  Results 

We  compare  parallel  LSPI  with  standard  LSPI  on  the  two  domains.  Both  methods  are 
implemented  in  C  and  the  communication  in  parallel  LSPI  is  implemented  with  MPI.  For 
parallel  LSPI,  we  report  the  performance  of  M  =  4  and  M  =  8  workers.  In  our  implementation, 
parallel  LSPI  uses  single  core  machines,  yet  the  shared-memory  architecture  (i.e.  multi-core 
on  a  single  machine)  is  also  applicable.  For  the  queuing  network  domain,  LSPI  and 
parallel-LSPI  update  their  policies  every  100  episodes,  and  both  of  them  terminate  after 
learning  1000  episodes,  which  corresponds  to  K  =  100  and  T  =  10  in  the  algorithm.  For 
persistent  search  and  track,  both  LSPI  and  parallel-LSPI  update  their  policies  every  1000 
episodes,  and  terminate  after  10,000  episodes.  We  set  y  =  0.95  for  both  domains.  All  the 
experiments  are  repeated  50  runs  with  the  averaged  results  and  standard  deviations  reported. 

We  also  try  different^  for  f-greedy  policy.  Higher  means  each  worker  takes  random  action 
with  higher  probability,  which  could  reduce  the  correlation  between  workers.  The  results  for 
the  queuing  network  are  shown  in  Figure  6,  and  results  for  persistent  search  and  track  are 
shown  in  Figure  7.  The  learned  parameter  0  is  recorded  when  it  is  updated,  so  each  point  on 
the  line  represents  the  performance  of  learned  parameter  at  the  end  of  an  iteration  of  the 
corresponding  algorithm.  For  parallel  LSPI,  we  record  the  consensus  as  the  the  learned 
parameter.  The  performance  of  a  learned  parameter  in  the  queuing  network  domain  is 
measured  by  the  average  of  losses  (where  loss  of  an  episode  is  defined  as  the  average  of  all 
waiting  customers  in  the  network  during  an  episode)  over  additional  500  episodes,  which  are 
conducted  by  following  the  deterministic  policy  implied  by  the  learned  parameter.  In 
persistent  search  and  track  domain,  the  performance  is  measured  by  cumulated  discounted 
rewards,  with  the  same  evaluation  procedure  as  the  queueing  network. 

Figure  6  suggests  for  queueing  network,  higher  e  yields  better  parallelization  since  the 
correlation  between  workers  is  smaller.  At  higher  (left  column),  which  encourages 
taking  random  action  more  during  learning,  parallel  LSPI  can  significantly 
accelerate  the  learning  process  comparing  to  taking  action  more  greedily  with 
respect  to  the  current  estimated  state  action  value  (right  column).  For 
parallel-LSPI  with  0.7  or  0.5greedy,  after  running  three  or  four  iterations,  it 
already  reaches  the  point  where  the  standard  LSPI  needs  to  take  ten  iterations  or 
more.  For  0.2-greedy  in  network  3  and  4,  parallel  LSPI  has  no  advantage  while 
consumes  more  computation  resources  than  non-parallelized  one  each  iteration. 

For  persistent  and  search  domain,  parallel-LSPI  with  t  =  o.l-greedy  already 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


(a)  0.7-greedy  in  type  1  network  (b)  0.5-greedy  in  type  1  network  (c)  0.2-greedy  in  type  1  network 


(d)  0.7-greedy  in  type  2  network  (e)  0.5-greedy  in  type  2  network(f)  0.2-greedy  in  type  2  network 


(g)  0.7-greedy  in  type  3  network  (h)  0.5-greedy  in  type  3  network(i)  0.2-greedy  in  type  3  network 


5  10 

time($) 


16 


(j)  0.7-greedy  in  type  4  network  (k)  0.5-greedy  in  type  4  network(l)  0.2-greedy  in  type  4  network 


Fig.  6.  Losses  of  the  learned  parameter  versus  learning  time.  Each  row  corresponds  to  a  network,  while 
each  column  corresponds  to  different.  Star  marker  represents  LSPI,  circle  marker  represents  parallel 
LSPI  with  four  workers,  and  square  marker  represents  parallel  LSPI  with  eight  workers.  The  0.5  s.e. 
error  bars  are  also  plotted. 


achieves  significant  speedup  than  its  non-parallel  counterpart,  increasing  can 
accelerate  further  but  not  much  more.  This  reflects  the  limitation  of  the  proposed 
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heuristic,  yet  parallelLSPI  still  shows  the  benefit  of  parallelization.  The  overhead 
of  parallelization  can  also  be  seen  on  the  figures.  We  can  see  that  parallel  LSPI 
requires  slightly  more  time  to  finish  the  same  amount  of  iterations  than  standard 
LSPI  does  due  to  communication  overhead.  Yet,  this  overhead  is  tolerable  since 
parallel  LSPI  usually  reaches  at  the  same  level  of  performance  as  standard  LSPI 
with  much  fewer  iterations. 

From  the  figures,  we  also  observe  that  the  advantage  of  parallelization  decreases  as 
the  number  of  workers  increases  (4  workers  vs.  8  workers).  The  acceleration  by 
doubling  the  workers  is  incremental  in  most  of  the  cases.  A  possible 


(a)  0.1-greedy  (b)  0.2-greedy  (c)  0.35-greedy 


Fig.  7.  Rewards  of  the  learned  parameter  in  persistent  search  and  track  domain  versus 
learning  time.  The  0.5  s.e.  error  bars  are  also  plotted. 


A  possible  explanation  is  that  the  correlation  between  workers  is  likely  to  increase  with 
more  workers  being  added.  Similar  to  the  parallel  TD,  the  degree  of  parallelization  of 
our  method  still  has  some  room  for  improvement.  This  limitation  is  not  explicitly 
implied  in  our  analysis  since  we  just  show  the  ideal  case  and  the  worst  case.  That  says, 
the  connection  between  the  degree  of  possible  parallelization  and  the  properties  of 
underlying  MDPs  and  feature  space  needs  to  be  further  explored.  We  leave  it  as  a  future 
work. 
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